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Abstract 

Transcriptome experiments are peiformed to assess protein abundance through mRNA expression analysis. Expression 
levels of genes vary depending on the experimental conditions and the cell response. Transcriptome data must be diverse 
and yet comparable in reference to stably expressed genes, even if they are generated from different experiments on the 
same biological context from various laboratories. In this study, expression patterns of 9090 microarray samples grouped 
into 381 NCBI-GEO datasets were investigated to identify novel candidate reference genes using randomizations and 
Receiver Operating Characteristic (ROC) curves. The analysis demonstrated that cell type specific reference gene sets display 
less variability than a united set for all tissues. Therefore, constitutively and stably expressed, origin specific novel reference 
gene sets were identified based on their coefficient of variation and percentage of occurrence in all GEO datasets, which 
were classified using Medical Subject Headings (MeSH). A large number of MeSH grouped reference gene lists are presented 
as novel tissue specific reference gene lists. The most commonly observed 17 genes in these sets were compared for their 
expression in 8 hepatocellular, 5 breast and 3 colon carcinoma cells by RT-qPCR to verify tissue specificity. Indeed, 
commonly used housekeeping genes GAPDH, Actin and EEF2 had tissue specific variations, whereas several ribosomal genes 
were among the most stably expressed genes in vitro. Our results confirm that two or more reference genes should be used 
in combination for differential expression analysis of large-scale data obtained from microarray or next generation 
sequencing studies. Therefore context dependent reference gene sets, as presented in this study, are required for 
normalization of expression data from diverse technological backgrounds. 
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Introduction 

During the last decade, there has been remarkable progress in 
the identification of human cells' transcriptome blueprint through 
small or large-scale quantitative gene expression studies. Since the 
gene expression is the major determinant of the protein 
abundance in the cell, transcriptome analysis experiments have 
been widely applied to reveal the molecular mechanisms of various 
ceUular conditions. Depending on the cell fate, expression levels of 
genes vary. Although it seems straightforward to assess these 
variations through Real Time quantitative PGR or microarrays, 
there is a continuing and confusing debate on what basis these 
variations should be considered as deviations from normal 
physiology. Therefore, there is a need for compilation and 
comprehensive analysis of a gene of interest across several 
experiments from different sources. A gene's expression data must 
be scaled in a comparable platform. Various normalization 
methods are available to scale these data within the same 
experiment, yet it becomes problematic to compare arrays of 
different sources without using references. 

Although most genes show variable expression depending on 
ceUular context, tissue of origin or treatment conditions, some 



genes are constitutively expressed in all cells in all conditions. 
These constitutively expressed genes are required for the 
maintenance of the basal ceUular functions such as metabolism, 
gene expression, protein synthesis and cell signaling [1,2]. These 
genes, called housekeeping genes, are generally assumed to have 
expression levels unaffected by tissue of origin or experimental 
condition. Therefore, they are widely used as reference genes for 
normalization of expression data. However, recent studies 
indicated that several widely used housekeeping genes have 
altered expressions under different experimental conditions [3- 
10]. Most of these studies focused on finding appropriate genes for 
normalization in individual cancer types. Yet, even the most 
commonly used reference genes (ACTB, GAPDH, TBP) were 
differentially expressed in different pathological stages of hepato- 
cellular carcinoma [7,11]. These findings bring forward the need 
to process high-throughput data in order to determine a global list 
of constitutively and invariably expressed genes that can be used as 
reference genes [12-14]. For example, Hruz T. et al. measured the 
standard deviation of gene expression across large sets of 
Afiymetrix arrays of human, mouse and Arabidopsis from the 
Genevestigator database and developed an online tool, Ref-Genes, 
that can be used to search for genes with minimal standard 
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Gene Symbol Name Ratio Percentile Rank 

Mean Median 



EEF2 


(cukaryotic translation GlonQation factor 2) 


0.979 


95.2 


98 


RPS10 




0.976 


96.5 


99 


B2M 


(b6ta-2-micro9lobuiin) 


0.975 


96.4 


99 


CFL1 


I iju mil ] 1 1 1 lUf 1 1 1 lUouic J j 


0.974 


94.9 


99 


UBC 


(ubicjuitin C) 


0.974 


96.4 


98.8 


RPL27 


^[lUUaUllldl jJIUlClll f 


0.973 


94.5 


99 


RPL3 


(ribosomal protsin t-3) 


0.973 


95.9 


99 


RPL7 


i riKriQnmal nrotoin 1 7 \ 

^riLfUoUillcll pi ULCII 1 L_ / ^ 


0.972 


96.1 


99 


ACTG1 


^dULiiii ^aiiiLiio 1 f 


0.971 


95.3 


99 


C0PS6 


Iv^Wr C7 LjiJI loU LU LI V Ul lUlUI Mwf Ul lUUt^l IIU IIUIIIUILjy oUUUIIIL U t/Ai d UlUUU^io 1 J 


0.971 


86.6 


88 


RPS17 


^rihoQnmal nrntfiin ^17^ 

^1 lUi-^DUl 1 Idl pi ULOII I \j i 1 f 


0.971 


96.3 


99 


RPL38 


^IIUUoUIIICII pi ULCII 1 LOO J 


0.97 


94 


97.3 


ATP5I 


(ATP synthas6i H+ transportiriQi mitochondrial FO complsxi subunit E) 


0.969 


92.4 


94.5 


RPS13 


\l lUUoUl 1 lal pi ULt:ll 1 O 1 


0.969 


96.2 


99 


NACA 


(nascGnt-polypsptids-associatsd complsx alpha polypsptlds) 


0.968 


95.6 


98.5 


PSMC1 


^pi UlCaoUl 1 le ^prUoUr MC, 1 1 IdOl Updll I f ^QO oUUUI II L, Ml ~abc, 1 } 


0.968 


92.3 


95 


RPL9 


(ribosomal protein L9) 


0.968 


93.8 


99 


TPT1 


(tumor protGin; translationally-controlled 1) 


0.968 


96.1 


99 


0AZ1 


lui 1 iiL[ ill ic \Jct_fdi LJUA y idoc diiLiz.yiiJc 1 1 


0.967 


93.5 


98.5 


RPL22 


^MUUbUllldl piUlClll 


0.967 


94.8 


97.2 


RPS16 


(ribosomal protsin SI 6) 


0.967 


95.9 


99 


UBB 


(ubic|uitin B) 


0.967 


93.7 


97 


C0X7C 


f r'\/triryyrr\moi r oviHaco cr ihiinif \/l lf\ 

^t^y ILfUl II Ul MC \j LIAlUdoC OUUUIIIL Vlll-'^ 


0.965 


92.6 


95.7 


H3F3A 


^no nibtoriG, rdiiiiiy orK) 


0.965 


95.5 


98 


RPL13 


(ribosomal protsin LI 3) 


0.965 


92.8 


98.8 


RPL13A 


(ribosomal protsin L13a) 


0.965 


96 


99 


RPN2 


^riKr^nhnrin \\\ 
^1 luup 1 lUI 1 1 1 M ^ 


0.965 


91.5 


95 


RPS27A 


^1 ILtUoUl 1 Idl pi ULCll 1 eX) 


0.964 


94 


98 


TUBB 


ILUUUIIM, UcLdJ 


0.964 


93 


96 


NDUFS5 


(NADH dshydroQsnase (ubicjuinons) Fe-S protsin 5, 15kDa (NADH-coenzyme Q rsductass)) 


0.963 


92.7 


96 


FAU 


^Pinl<pl-Ricl<ic-R*sill\/ mi irinci carpnma \/iri ic /PRR-^^l i ihini litm iqIv/ fiYnrf^QCpH ffnv Hpri\/pH V rih/^cnmal nrntPin ^"^0^ 

^ n M IrVCI DlOrVIo r\C Illy IIIUMIJC odi \ 1 Id V II Uo \ n D r\ 1 VI UO v ) UUIL^UIIL/Uoiy OApiOOOCU ^l ^JA UCI I VCU f, II UVJoLJI 1 Idl p 1 U lt>JI 1 <jo\j f 


0.962 


93.7 


98 


RPL30 


niLJUoUllldl pi LJLCII 1 L_OU J 


0.962 


94.4 


99 


RPS11 


/ rihoGnmal nrotoin ^ 1 1 ^ 

^1 lUUoUl 1 Idl pi ULCII 1 O 1 1 1 


0.962 


92.2 


99 


RPS3A 


(ribosomal protein S3A) 


0.962 


96.1 


99 


S0D1 


(superoxide dismutase 1; soluble (amyotrophic lateral sclerosis 1 (adult))) 


0.962 


93.6 


97 


COX7A2 


(cytochrome c oxidase subunit Vila polypeptide 2 (liver)) 


0.961 


93.6 


97 


NCL 


(nucleolin) 


0.961 


93.3 


97 


RPL21 


(ribosomal protein L21) 


0.961 


94.2 


99 


SRP14 


(signal recognition particle 14kDa (homologous Alu RNA binding protein)) 


0.961 


92.7 


97 


RPS6 


(ribosomal protein S6) 


0.96 


95.2 


99 


EEF1B2 


(eukaryotic translation elongation factor 1 beta 2) 


0.959 


94.6 


98 



Figure 1. Screenshot of the first 40 lines from reference gene lists. Connpiete set of MeSH classified reference gene lists from 9090 array 
samples are given in hyperlinked Supporting Information SI spreadsheet. 
doi:1 0.1 371 /journal.pone.0093341 .gOOl 



deviation across a chosen set of arrays [15,16]. They concluded 
that no genes are universally stable, but a subset of stable genes 
with minimal variance exists for each biological context that can 
be used for the normalization of RT-qPCR data. 

Although publicly available microarray or next generation 
sequencing (NGS) experiments were used to generate lists of 
candidate reference genes, novel statistical approaches for testing 
accuracy of a reference gene are stiU needed. Herein, we aimed to 
confirm the reliability of available housekeeping gene sets using 
randomization as well as to determine other invariably expressed 
gene sets based on Receiver Operating Characteristic (ROC) 
curves for classifiers under large number of experimental 
conditions and across a wide panel of tissue types. 

Our method provides reference gene hsts for global and cell- 
type specific normalization of transcriptome data. Gene lists are 
scored based on their expression stability, and classified according 
to the Medical Subject Headings (MeSH) associated with the 
transcriptome study that was published and indexed by National 
Library of Medicine. Gene lists are provided in the supporting 
dataset (Supporting Information SI). RT-qPCR assessment of 



selected reference genes is also provided for various tissue-specific 
cancer cell lines in vitro. 

Results and Discussion 

Development of a methodology to identify consistently 
stable genes 

Housekeeping/reference genes should exhibit relatively con- 
stant expression levels when compared to non-housekeeping genes. 
To identify the consistency of the so-called housekeeping genes 
across large-scale experiment sets, the gene expression data were 
downloaded from the NCBI Gene Expression Omnibus (NCBI- 
GEO) database [17], and aU spot data were extracted along with 
their associated metadata from the platform files. The data set 
approximately contained 142 million oligonucleotide microarray 
spots from 9090 microarray samples, which were grouped into 381 
GEO datasets. Percentile-ranking method was applied indepen- 
dently on the global mean normalized data within each sample in 
each GEO dataset. This process provided a rank value for each 
gene within a sample. Therefore, the rank measure was 
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0.4 0.6 
Ratio,=o5 



Ratio.^ 




0.4 0.6 

RatiOt=o,05 



0.4 0.6 

Ratio,=o.oi 



Figure 2. Graph of ratio of tKie number of sets in which a gene has a coefficient of variation (CV) less than a threshold (t), to the 
number of sets in which the gene is observed. Graphs were plotted for CV value thresholds t = 0.5 t = 0.1 , t = 0.05 and t = 0.01 . Percentage of 
occurrence (PO) Is at least 50% of the total sets. The y-axis indicates the number of genes having a ratio greater than the ratio value at the 
corresponding x-axis. This function is described in the methods section, as x-axis being r and y-axis being fpo(r). The black curve represents 
housekeeping genes while curves with grey colors show 5 random sets of genes excluding the housekeeping genes. Random sets of genes have the 
same mean rank distribution as of those housekeeping genes. 
doi:10.1371/journal.pone.0093341.g002 



comparable across experiments and platforms, allowing the 
analysis of the behavior of a gene globally across GEO datasets. 
Using ranks of genes is a standard method as also employed in 
Quantile Normalization, which is a common microarray normal- 
ization technique [18]. 

The average changes in the rank of each gene in each GEO 
dataset (GDS) was computed based on the ratio of the standard 
deviation to the mean and termed as its coefficient of variation 
value (CV) (see Methods). CV, as a standardized measure of 
sample variability, help suggest candidate reference genes whose 
expression is stable and unaffected by the experimental condition 
since it has been successfully apphed in identification of reference 
genes in multiple studies with varying thresholds [19-22]. Most 
genes had coefficient of variation values only for a subset of the 
available GDSs. Therefore, for each gene G, and a predefined CV 
threshold i, the ratio of GEO datasets, where CV is less than 
Ratiot(G;), is calculated as the measure of stable expression. 
Ratio,(G;) is the ratio of GEO datasets, in which a gene exhibits a 
coefficient of variation less than t. The RatiOt(G,) value, by itself, is 
not a sufficient measure to identify statistically significant reference 
genes. A gene that has a small enough CV value can get a perfect 
ratio even if it is observed in only a single GEO dataset. Therefore, 



a new parameter, PO(Gi), calculated as the percentage of datasets 
that each gene has been observed in at least once, was used to 
adjust RatiOt(G,) parameter (Methods). Ratio,(G,) in the context of 
PO(Gi) allowed for accurate normalization of a large set of 
microarray data, since it takes into account the information about 
the differences in probe /clone composition of the arrays. We 
assessed the utility of these measures by implementing a simple 
threshold based classifier and computing the sensitivity and 
specificity of this classifier using a published reference gene set 
as the ground truth (Methods). Two sets of reference genes were 
generated. First list contains reference gene lists with a CV 
threshold of 0.12 and with various sensitivity values while the 
second list is built with sensitivity of 0.5 for a range of CV 
threshold values. Figure 1 shows the first 40 genes from 342 
reference genes with a CV of at most 0.12, sensitivity equal or over 
0.5, specificity of 0.97 and minimum percentage of occurrence of 
0.75 in all 9090 array samples from 381 GEO datasets (See 
Reference Gene Lists in Supporting Information SI for the 
complete list). The high specificity shows that the CV measure 
coupled with percentage of occurrence is an accurate measure for 
identification of reference gene sets. 
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0.4 0.6 

RatiOt=o,o5 




Figure 3. Graph of ratio of tiie number of sets in which gene has coefficient of variation less than 0.05 to the number of sets in 
which the gene is observed. Gene is observed at least a-) 75%, b-) 50%, o) 25% and d-) 5% of the total sets. The y axis indicates the number of 
genes having a ratio greater than ratio value at the corresponding x axis. The curve with red color represents housekeeping genes while curves with 
other colors shows 5 random sets of genes excluding the housekeeping genes. Random sets of genes have the same mean rank distribution as of 
those housekeeping genes. 
doi:10.1371/journal.pone.0093341.g003 



Housekeeping/Reference genes should exhibit relatively con- 
stant expression levels and their average rank change should be 
lower than that of remaining genes. Hence, we assumed that the 
candidate reference genes should have lower CV and higher 
Ratio,(G;) than that of randomly selected genes. In order to 
compare the expression behaviors of reference genes to that of 
random genes, we first analyzed the largest previously reported 
housekeeping datasets [1,23]. There was high variation between 
genes in the first dataset, in parallel with the authors' observations. 
Therefore, the dataset provided by Eisenberg et al. was used in our 
analysis. The 566 housekeeping genes in this dataset were 
compared to five different randomly selected sets of non- 
housekeeping genes having the same mean rank distribution as 
that of the housekeeping gene set. 

Normalized gene expression values were analyzed for the CV 
thresholds Z=0.5, Z=0.1, i=0.05, and /=0.01 and minimum 
percentage of occurrences PO = 75%, 50%, 25% and 5%. When 
coefficient of variation, CV, was less than 0.5 (i=0.5), percentUe- 
ranked GEO datasets showed high RatiO( values for nearly all of 
the analyzed genes (housekeeping or not), for aU PO values. At 
lower CV thresholds, t=Q.\, i=0.05 and i=0.01, housekeeping 
genes had significantiy higher Ratio; values than those of random 
gene sets for aU PO values. The randomization approach allowed 



us to test an optimum range of / and PO values that can 
discriminate between reference and non-reference genes. Graphs 
of Ratio; at PO = 50% with four different CV thresholds (Figure 2) 
and graphs of RatiO( at CV /=0.05 with four different PO 
thresholds were plotted (Figure 3). These graphs showed that 
measures could accurately distinguish previously identified refer- 
ence genes from the randomly selected ones. 

The difference in the ratio distribution of housekeeping genes 
compared to that of the randomly selected non-housekeeping gene 
sets was statistically significant, as shown by Kolmogorov-Simirnov 
tests (Table SI in Supporting Information S2, p<0.0001). The 
random gene sets, excluding the housekeeping genes, did not show 
any significant ratio distribution difference, as shown by using 
Bonferroni adjusted Kolmogorov-Simirnov tests (Table S2 in 
Supporting Information S2). These observations proved that the 
expression of the tested housekeeping genes was less variable 
across different experiment sets compared to that of randomly 
selected gene sets. 

In addition, we assessed the sensitivity and specificity of a simple 
threshold based classifier using the CV threshold using the 
published housekeeping gene set as the ground truth. The receiver 
operator characteristic (ROC) curve in Figure 4 shows that more 
than half of the published reference genes can be identified with a 
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Tissues (MeSH A10) 



0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1 -Specificity 

Epithelial Cells (MeSH A11. 436) 



0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1 -Specificity 



Immune System (MeSH A1 5.382) 




0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1 -Specificity 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1 -Specificity 

Cells, cultured (MeSH A1 1.251) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1 -Specificity 

Neoplasms (MeSH C04) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1 -Specificity 



Figure 4. Receiver Operator CKiaracteristic (ROC) curve of the simple threshold based classifier. The receiver operator characteristic 
(ROC) curve of a simpie threshold classifier over all datasets and some MeSH categories. The housekeeping gene set by Eisenberg et al. [22] is used as 
the ground truth. The simple threshold classifier classifies all the genes with CV values below a threshold as housekeeping genes. By using different 
CV thresholds the stringency of the classifier can be varied and the ROC curve can be plotted accordingly. Sensitivity is the ratio of correctly classified 
ground truth genes over all ground truth genes and specificity is the ratio of correctly identified non-housekeeping genes over all non-housekeeping 
genes. Complete set of MeSH classified RO-curves are given in hyperlinked Supporting Information S3 spreadsheet. 
doi:1 0.1 371/journal.pone.0093341 .g004 



specificity of 0.97. Similar analy.ses performed on datasets grouped 
by MeSH are available in Supporting Information S3. 

Identification of novel reference genes 

Our primary goal in this study was to define a novel reference 
gene set that can be used for both global and cell type-specific 
normalization of expression experiments. For this purpose, a 
classifier that can be used to identrly novel reference genes was 
buUt. Based on our analysis with the known 566-housekeeping 
gene set, coefficient of variation (CV) measure was set as the 
variable for building the classifier while minimum percentage of 
occurrence (PO) and Ratio, values were fixed at 75% and 0.90 
respectively. The accuracy of this classifier in predicting reference 
genes was assessed in comparison with the previously reported 566 
housekeeping gene set [23]. Among the candidate reference genes 
that were identified by our classifier at each CV threshold, the 
known 566 housekeeping genes were regarded as true positives 
(TP) and the other genes were regarded as false positives (FP) to 
plot a receiver-operating characteristic (ROC) curve. In the 
supporting gene lists, the sensitivity was set to 0.5, implying that 
half of the identified reference genes are the known housekeeping 
genes. The ROC curve of our classifier for CV values ranging 



from 0.01 to 10 showed its effectiveness in finding true positives 
(sensitivity) (Figure 4). Curves for the overall (Figure 4A) and 
specific reference gene sets are available in the supporting 
hyperlinked dataset (Supporting Information S3). 

According to a classic ROC curve, a good classifier should 
capture most of the known housekeeping genes while providing a 
relatively small number of false positives. However, in this 
particular case, the false positives could be the newly identified 
candidate reference genes. Therefore, their CV and Ratiot values 
should still be considered for their potential as a reference gene. 
CV, Ratio and Percentile Rank values are provided for each gene 
in the supporting hyperlinked dataset. The global reference genes 
were given in this list under the category of All with CV of 0.12, 
sensitivity equal or over 0.5, specificity of 0.97 and minimum 
percentage of occurrence of 0.75 (Figure 2 A and Supporting 
Information SI). 

In order to determine origin- and cell type-specific reference 
genes, the GEO sets were classified according to the Medical 
Subject Headings (MeSH) associated with their experimental data, 
published and indexed by National Library of Medicine. Of the 
381 GEO datasets analyzed in this study, 341 were associated with 
272 different medical publications and 264 of these publications 
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Table 1. MeSH groups 


and number of NCBI-GEO data sets in each group. 






MeSH Tree Number 


MeSH Heading 


Number of Sets 


ALL 


ALL 


381 


AlO 


Tissues 


77 


Al 0.272 


Epithelium 


16 


A 10.690 


Muscles 


46 


All 


Cells 


223 


All. 118 


Blood Cells 


47 


Al 1.148 


Bone Marrow Cells 


14 


Al 1.251 


Cells. Cultured 


157 


Al 1.284 


Cellular Structures 


40 


Al 1.329 


Connective Tissue Cells 


34 


Al 1.436 


Epithelial Cells 


48 


Al 1.627 


Myeloid Cells 


17 


Al 1.733 


Phagocytes 


14 


Al 1.872 


Stem Cells 


22 


A15 


Hemic and Immune Systems 


74 


A15.145 


Blood 


51 


Al 5.378 


Hematopoietic System 


14 


Al 5.382 


Immune System 


60 


C04 


Neoplasms 


108 


C04.557 


Neoplasms by Histologic Type 


68 


C04.588 


Neoplasms by Site 


68 


C04.697 


Neoplastic Processes 


16 



doi:l 0.1 371/journal.pone.0093341 .tOOl 



were associated with a total of 5754 MeSH terms [24]. These gene 
sets were grouped into three anatomy (tissues, cells, hemic and 
immune systems), and one disease (neoplasms) based MeSH 
categories (Table 1). GDS identification numbers associated with 
each MeSH are given in Supporting Information S 1 . Two lists of 
reference gene sets based on CV and sensitivity, were provided for 
each MeSH category. First list was constructed based on a fixed 
threshold for CV<0.12 and the second on a frxed threshold of 
sensitivity>0.50. Both gene sets had a fixed PO threshold of 75%. 
The reference gene lists, which include CV, PO, specificity and 
sensitivity values for each MeSH category, are provided as 
supporting hyperlinked dataset (Supporting Information SI). 

Experimental validation of selected reference genes in 
different cancer cells 

Among the large panel of identified reference genes, 1 7 genes 
were selected for experimental validation (Table 2). Expression 
levels of these reference genes {AARS, ACTB, CFLl, EEF2, 
GAPDH, GSTOl, H2AFZ HBXIP, RPL30, RPL41, RPL7, RPJV2, 
RPSIO, RPS17, RPS3A, SODl, TPTl) were assessed by RT-qPCR 
in 16 different cell lines consisting of 8 Hepatocellular Carcinoma 
(HCC) (HepG2, FOCUS, Mahlavu, Hep3B, Hep3B-TR, Huh7, 
SkHepl, and PLC), 5 Breast Cancer (MDA-MB453, HCC 1937, 
BT20, T47D and CAMA-I), and 3 Colon Cancer (HCT116, 
HT29, and SW620) cell lines. Housekeeping/ reference genes are 
expected to have high expression and low variability in expression 
levels between cells. Therefore, in RT-qPCR amplification, they 
should have low threshold cycle (Cq <30) and low standard 
deviation. All of the tested reference genes met this requirement 
for Cq values and standard deviations (Figure 5). The best 



candidate reference genes appeared at the center of the graphs for 
each group (Figure 5 A-C). In order to emphasize the stability of 
these genes, a well-expressed non-housekeeping gene, RECK, was 
included into the CV analysis. It was not included in the analysis 
with NormFinder and geNorm due to its high variability (CV 
around 1.0). Genes were ranked by their stability based on their 
CV. Comparison of the expression values by analysis of variance 
(ANOVA) identified RPL30 as the most stable gene with lowest 
variance within and between the Liver, Breast and Colon 
Carcinoma cell line groups, followed by RPL41, RPSIO and 
CFLl. The variance within each of these three groups was low for 
RPS3A, RPS17, RPL7, ACTB, H2AFZ, and ifflZZP reference gene 
expression. However, the variance between the three groups was 
relatively high. This implied that these reference genes are more 
suitable for normalization of cell lines from single tissue-origin 
than cell lines with diflFerent tissue-origins. GSTOl, TPTl, RPM2, 
SODl showed the highest variability. 

Real-time quantitative PCR gene expression stability was 
further determined using geNorm and NormFinder software 
[3,25]. GeNorm is a pairwise comparison-based model. For each 
gene, it calculates an expression stability value based on the 
average pairwise variation between all tested genes. The genes are 
ranked according to their expression stability through stepwise 
exclusion of the least stable gene (highest stability value). 
NormFinder is a model-based approach that estimates the 
variation between sample subgroups, such as liver, breast and 
colon cancer cell lines, as well as the overall expression variation of 
the tested genes. Unlike geNorm, the resulting stability value and 
the stability rank order changes in NormFinder depending on the 
input genes. Therefore, geNorm and NormFinder stability analysis 
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Figure 5. Graphs of coefficient of variations and relative expression levels of 17 reference genes in RT-qPCR. Coefficient of Variation 
(CV) was calculated based on the relative expression (efficiency"'^'"'') of each housekeeping gene in (A) Liver, (B) Breast, (C) Colon and (D) All cancer 
cell lines. 
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was performed with and without the least stable genes EEF2, 
GSTOl, RPJV2 and TTTl, which have a standard deviation value 
above the mean standard deviation (stdev= 1.42). The selected 
reference genes were ranked according to their overall stability 
values across 16 different cell lines, determined by geNorm and 
NormFinder (Table 3, Figure 6, Table S3 in Supporting 
Information S2). The ranking of these genes was similar for both 
methods. Ribosomal protein genes RPSIO, RPL41, RPL30, and 
RPS3A were the most stable genes among all 16 cell lines 
according to geNorm and NormFinder respectively (Table 3, 
Tables S3-S4 in Supporting Information S2). Two traditional 
reference genes, ACTB and GAPDH, were less stable than the 
ribosomal genes and ranked lower in the stability rank list. CFLl 
and HBXIP were the most stable genes after the ribosomal protein 
genes in general (Figure 7A and Table 3). While the commonly 
used reference gene GAPDH was one of the stable genes within 
liver cancer cell lines, it was among the least stable genes within 
breast and colon cancer cell lines. It had a high variance in terms 
of stability value and expression value between the three types of 
carcinoma cell lines investigated. This implied that GAPDH could 
be used as a reference gene when comparing liver cancer cell lines, 
but not breast and colon cancer cell lines. ACTB was more stable 
than GAPDH in Breast and Colon Carcinoma cell lines and had a 
stability value similar to GAPDH in HCC cell lines. Moreover, 
even though reference genes were known to avoid regulation by 
miRNAs, recent findings showed that GAPDH and ACTB are 
direct targets of miR-644a [26,27]. Besides, several pseudogenes of 



GAPDH and ACTB were revealed, making them less reliable to be 
used as reference genes [28]. 

Stability within liver, breast and colon cancer cell lines was also 
analyzed separately. The ribosomal genes RPL30, RPL41, RPL7, 
RPSIO, and RPS3A were stable within each cell line group 
(Figure 7B-D, Table 3, Tables S3, S4 in Supporting Information 
S2). HBXIP, CFLl, and GAPDH were among the most stable genes 
together with the ribosomal protein genes in 8 HCC cell lines 
analyzed. HBXIP was the third most stable gene in breast cancer 
cell lines according to geNorm, but ninth according to 
NormFinder ranking. H2AFJ^ and HBXIP were ranked among 
the most stable genes together with RPSIO, RPL41, and RPS3A in 
colon cancer cell lines. The difference in the gene stability ranking 
order between the two softwares is due to the different 
methodologies. geNorm is based on the comparison of expression 
similarity of the tested genes. This may cause exclusion of a 
candidate reference gene with a relatively stable expression in 
early steps of the ranking procedure, if all other genes in the list 
have similar expression profiles. Another disadvantage of this 
approach may be a bias towards co-regulated genes, since these 
genes will have similar expression profiles and hence may be 
ranked top in the stability list, regardless of their expression 
stability. NormFinder does not have such a bias, since expression 
stabihty of each gene is determined independent of the rest of the 
genes. However, since NormFinder ranking is based on the 
variation between groups, when comparing groups like liver, 
breast, and colon carcinoma cell lines, if the variation within group 
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Figure 6. Stability analysis of reference genes in RT-qPCR based on CV, geNorm and NormFlnder. Genes ranked by stability based on (A) 
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respectively. 
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is high, tlien tlie variation between group.s may considered as 
lower, leading to false positive results. 

In order to determine the optimal number of reference genes for 
accurate normalization, pairwise variation Vn/„+i analysis was 
performed using the geNorm software. Taking 0.15 as a cut-off 
value as proposed, the use of RPL41 and EPS 10 together as 
reference genes were enough for accurate normalization (V2/3 
value = 0. 1 34) when comparing Liver, Breast and Colon Carci- 
noma cell lines. An even more accurate normalization can be 
achieved ii RPL7 and RPS3A axe also included as reference genes. 
(V3/4 value = 0.091 and V4/5 value = 0.067). 

Conclusions 

Even the most frequently used reference genes are subject to 
differential regulation under specific treatments or between 
different cell lines or tissues. Therefore, new reference gene sets 
should be determined instead of using traditional housekeeping/ 
reference genes that are themselves prone to differential regula- 
tion. The use of two or more housekeeping genes for normaliza- 
tion can improve the reliability of normalization [5,9,29]. The 
largest meta-analysis for reference gene identification up-to-date 
compiled 1431 samples from 104 microarray data sets classified 



into 4 physiological states with 1 3 organ/tissue types and identified 
reference gene candidates mostly associated with transcription, 
RNA processing and translation [30]. Hruz et al. 2011 also 
performed a large scale meta-analysis across multiple species and 
sources using Genevestigator database [16]. They have used ranks 
of standard deviations using mouse Affymetrrx datasets to support 
context specificity of reference gene sets. Our study also is 
comprehensive containing 142 million oligonucleotide microarray 
spots from 9090 microarray samples, grouped into 381 GEO 
datasets from multiple platforms. We also provide a novel 
methodology based on randomizations and ROC that allows 
testing the specificity and sensitivity of classifying a gene as 
reference or non-reference. Previously, a meta-analysis of 13,629 
human gene array samples from GEO database identified 
candidate housekeeping genes, including RPS13, RPL27, RPS20 
and OA^l. For each gene the coefficient of variation (CV) of its 
expression and the maximum fold change was calculated to 
identify genes with the minor variation in expression [19]. 
Recently, Eisenberg et. al. published an updated new housekeeping 
gene list based on analysis of RNA-seq data [31]. However, MeSH 
classification used in the present study has not been applied to 
reference gene set combinations previously. In this study, we 
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Figure 7. Stability analysis of reference genes in tissue-specific cell lines in RT-qPCR. Stability analysis In (A) Liver, (6) Breast, (C) Colon and 
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determined novel reference gene sets that can be used for both 
global and MeSH category-based normalization of gene expres- 
sion data. Furthermore, we validated stability of known and novel 
reference genes obtained from meta-analysis with cancer cell line 
qPCR studies. 

Our reference gene lists were dominated by the ribosomal 
protein genes and genes that are involved in maintenance of basal 
cellular activities such as translatioLi aiid metabolism. Especially, 
RPL30 and RPL41 are good reference genes for comparing all cell 
lines regardless of their origin. Previously, 45 1 housekeeping genes 
that were expressed in all of the 1 9 distinct normal human tissue 
types studied, were shown to have variable expression levels 
among different tissues and ribosomal genes, which were among 
the most stable genes, were suggested as reference genes suitable 
for normalization purposes [1]. A more recent study however 
suggested that ribosomal protein genes, which display stable 
expression in meta-analysis, indeed exhibit variation in mRNA 
expression in a tissue-dependent manner [32]. These findings once 
more emphasize that finding a common reference gene, ribosomal 
or not, is not possible. The genes located at the top the reference 
gene set lists display the highest confidence based on CV and PO 
values therefore those genes are less likely to be wrong. However 
every computational analysis might have false positive results for 
this reason, users should select their reference gCLie of iiiterest and 
experimentally validate the stability of the expression of that 
reference gene under their experimental conditions. Furthermore 
more than 2 reference genes per experiment would be a better 
measure for precision. Computational calculations present a road 
map to guide the experimentalists. The data sets we used to build 
MeSH dependent reference gene lists originated from various 
sources which aimed to identify diflFerentiaUy expressed gene sets 
under specific experimental conditions in order to minimize the 
homogeneity between and within data sets. Data set IDs are given 
together with reference gene lists in Supporting Information SI. 
Meta-analysis and consequent classification of tissue- or cell type- 
origin specificity of housekeeping genes appears to be the best 
approach to determine the most appropriate reference genes 
among the large number of known housekeeping genes. Further- 



more identification of housekeeping genes enables normalization 
of transcriptome data not only for microarrays but also for RNA- 
seq experiments [33,34]. RNA-seq technology is advantageous for 
detecting genes that are expressed in low levels. Hence, similar 
studies with RNA-seq data may increase the reliability of 
housekeeping genes when large number of NGS data are available 
[14] and can be used to identify a universal reference gene set. The 
tissue specific reference gene lists presented in this study, provide 
housekeeping genes that can be exploited as references in 
differential expression analysis of data from variety of transcrip- 
tome and RT-qPCR experiments. 

Methods 

In order to normalize expression values, percentile ranking have 
been applied. For each gene G, in a sample a single rank value, 
T{G,.ej) was computed (Equation 1). For a gene that was covered by 
multiple probesets in a sample, the average gene probeset rank 
value was used [23]. 



r(Gi,ej) = 



I { Gk\GkeG A ex{Gk,ej) < ex(G,-,e;) } | 
\G\ 



xlOO (1) 



where ex(g,e) is a function, which gives the mean normalized 
expression value of gene g in sample e and G is the set of all genes. 

Computation of coefficient of variation value for each gene in 
each GEO dataset and identification of candidate reference genes 

Let G = {G] G,„} be the set of genes and S = {«; «„} be a 

GEO dataset of n samples. For gene G, in experiment ej, a single 
rank value, r{G,.ej), was computed. Given a set S with n 
experiments, we had at most n rank values for a gene. The 
average amount of change in the rank of gene G„ in set S was then 
computed by the Coefficient of Variation CV(5'.G,) as given below 
in Equation 2. 



CV(S,Gi)-- 



KS,Gd 



(2) 
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where 



E r{G,,ej) 

n{S,Gd=— ,'iejsS 

n 



and 



a(5,G,) = 



\ 



- J2 {'iGi,ej) - fiiS,Gt)f,yejeS 



(3) 



(4) 



In this analysis, we computed a coefficient of variation value for 
each gene G; in each GEO dataset S. 

Next, we identified candidate reference genes with coefEcient of 
variation values below a predefined threshold, t, observed in as 
many experiments as possible. In order to account for platform 
related differences, for each gene Gj and threshold t, we computed 
a ratio. Ratioj(G,) as follows: 



RatioKG,) = 



\{CV{S,Gd\CV{S,Gi)<t}\ 
\{CV{S,Gi)\CV{S,Gi)>0}\ 



(5) 



Calculation of percentage of occurrence 

Let ^oW be the function that gives the number of genes with 
Ratiot greater than a given r and occur in at least in P0% of the 
datasets (Equation 6). 



fpoir) = \{Gi\Ratio,iGi)<r}\ 



(6) 



We plotted and compared the graph of^)o(!), varying r from 0 
to 1, for housekeeping genes and randomly selected non- 
housekeeping genes. 

Calculation of specificity and sensitivity 

We use the housekeeping gene set published by Eisenberg et al. 
[22] as the ground truth set of housekeeping genes. All the 
remaining genes are assumed to be non-housekeeping genes for 
the specificity and sensitivity analysis. The simple threshold 
classifier we use simply classifies all genes with a CV value below 
than a given threshold in at lea.st Ratio, of the datasets as 
housekeeping genes. All these analyses are performed at a fixed 
percentage of occurrence of 75% and with a fixed Ratio; of 0.9. 
The ground truth genes that are identified as housekeeping genes 
by this classifier are true positives (TP) whereas, the ground truth 
genes that are identified as non-housekeeping genes are false 
negatives (FN). Similarly, genes not in the ground truth set but 
identified as housekeeping genes are false p()siti\'e g(;n(;s (FP) and 
genes not in the ground truth set identified as non-housekeeping 
genes by the classifier are true negatives (TN). Using these 
definitions, sensitivity is given by TP/(TP-I-FN) and specificity is 
computed as TN/(TN+FP). 

Cell Lines 

Cell lines were obtained from the following sources and 
vahdated by STR analysis: HepG2 (ATCC HB-8065), FOCUS 
[35], Mahlavu[36], Hep3B (ATCC HB-8064), Hep3B-TR [37], 
Huh7 gCRB JCRB0403), SkHepl (ATCC HTB- 52), PLC 



(ATCC CRL-8024), MDA-MB-453 (ATCC HTB- 131), 
HCC1937 (ATCC CRL-2336), BT-20 (ATCC HTB-19), T47D 
(ATCC HTB-133), CAMA- 1 (ATCC HTB-21), HCTl 16 (ATCC 
CCL-247), HT29 (ATCC HTB-38), SW620 (ATCC CCL-227). 

RNA extraction 

RNA was extracted from 8 Hepatocellular Carcinoma cell lines 
(HepG2, FOCUS, Mahlavu, Hep3B, Hep3B-TR, Huh7, SkHepl, 
PLC), 5 Breast Carcinoma cell lines (MDA-MB453, HCC1937, 
BT20, T470, CAMA I) and 3 Colon Carcinoma ceU lines 
(HCTl 16, HT29, SW620) with NucleoSpin Total RNA Isolation 
Kit and the concentration and purity of total RNA from each cell 
line was measured by using a NanoDrop Spectrophotometer 
(NanoDrop Technologies). Reverse transcription was performed 
using RevertAid First Strand cDNA Synthesis Kit (Fermentas) 
from 2 Hg RNA with oligodT primer. 

Real-time quantitative PCR and stability analysis 

Primers for 1 7 selected housekeeping genes were designed using 
the Primer3 software (Table S5 in Supporting Information S2). A 
cDNA dilution series for each primer set in triplicate was analyzed 
to calculate efficiencies of the primers using the finear regression 
slope of the dilution series with the equation Efficiency = 
jQ(-i/si,.pe)_j^ Real -time quantitative PCR assays were performed 
in duplicate for each candidate gene using 1 |il of 1:100 diluted 
cDNA template with DyNAmo SYBR Green qPCR Kit 
(Finnzymes) on BioRad iCycler Real-Time qPCR System. The 
following program was used: Initial denaturation at 95°C 15 min 
amplification for 45 cycles (95°C 15 s followed by 57°C 30 s and 
72°C 30 s), and final extension at 72°C 10 min. Melting curx^e 
analysis was done for each run, in addition to agarose gel 
electrophoresis, to confirm the amplicon size and presence of a 
single gene-specific peak free from any primer-dimer or genomic 
DNA amplification. 

BioRad iCycler was programmed to set the Cq threshold on a 
fixed level. Cq vEilues generated by BioRad iCycler system were 
transformed into quantities (relative expression values) according 
to Vandesompele et al. [25]. Relative expression values were 
calculated with the equation: Relative Expression = Eflicien- 
cy Coefficient of Variation (CV) was calculated as the ratio 
of standard deviation to the mean relative expression. geNorm and 
NormFinder software were used for stabiUty analysis. 

Supporting Information 

Supporting Information SI Reference Gene Lists. 

(ZIP) 

Supporting Information S2 Tables S1-S5. Table SI. 

Kolmogorov-Simirnov Test results for the h\pothesis comparing 
pair-wise equivalence of the ratio distribution of housekeeping 
gene set to the ratio distribution of random sets of genes excluding 
the housekeeping genes. Table S2. Kolmogorov-Simirnov Test 
results for the hypothesis comparing pair-wise equivalence of the 
ratio distribution of random sets of genes, excluding the 
housekeeping genes. Table S3. Stability values of 17 reference 
genes calculated by NormFinder and geNorm. Table S4. Stability 
values of 13 genes, with standard deviation lower than 1.42, 
calculated by NormFinder and geNorm. Table S5. Real-time 
quantitative PCR primers. 
(PDF) 
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